FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression

A Fast, Training-free Approach to Efficient LLM Deployment

Published

May 29, 2025

Authors: J. Tian et al.
Published on Arxiv: 2025-05-29
Link: http://arxiv.org/abs/2505.23966v1
Institutions: University of California, Santa Barbara • Intel Corporation
Keywords: large language models, model compression, low-rank decomposition, principal component analysis, structural pruning, attention mechanisms, importance ranking, inference speedup, activation space transformation, WikiText-2, Llama-2, Mistral-7B, SVD-LLM, SliceGPT

Random Unsplash-style image

Recent advances in Large Language Models (LLMs) have propelled state-of-the-art performance in Natural Language Processing, yet these models are difficult to deploy in resource-limited settings due to their high computational and memory requirements. Model compression—through strategies such as quantization, knowledge distillation, pruning, and particularly low-rank decomposition—has emerged as an essential direction, though most existing methods bring accuracy decline and inefficient architectures.

To address these challenges, the authors propose a novel solution. Their approach and main contributions are summarized as follows:

Following from the methodological innovations, the main findings and quantitative results include:

Based on these results, the authors conclude with the following key points: